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1. INTRODUCTION 

Epidemic diseases can be extremely dangerous with its hazarding short term and long-term effects. 
So, understanding factors associated with it and applying precautions just in time would have positive influence 
on prevention of the spread of disease, can save many lives and eliminate negative consequences. In 2019, 
outbreak of a new coronavirus, causing the respiratory illness had been identified in Wuhan, China. Virus later 
had seen in different countries and regions as well [1]. The name "coronavirus" is derived from Latin corona, 
meaning "crown" or "wreath" [2]. As stated by Unhale Coronaviruses are a group of enveloped viruses. They 
make up a large family of viruses that can infect birds and mammals, including humans [1]. As Syed indicated 
some of the common symptoms highlighted in literature are runny nose, headache, cough, sore throat, fever, a 
general feeling of being unwell [3]. Human coronaviruses most commonly spread from an infected person to 
others through air by coughing and sneezing, close personal contact [3]. Common signs of infection include 
fever, cough, and respiratory difficulties. Serious cases can lead to even death [3]. Studies and researches 
continue worldwide for the treatment of the virus worldwide. Washing hands with soap and water, paying 
attention to social distancing, avoidance of touching eyes, nose, or mouth with contaminated hands are 
suggested. Specialists are raising awareness on the precautions associated with the disease and applying several 
medical approaches for the treatment of the form it is seen in human beings [3]. 


2. RESEARCH METHOD 
Purpose of data mining is to extract knowledge and insights from large amounts of data. In doing so 
a systematic approach is followed [4]. As it is illustrated in Figure 1, data mining process is composed of some 
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set of steps [5]. These include business understanding, data understanding, data preparation, model building, 
testing and evaluation with deployment as indicated by Shearer [5]. Business understanding refers to the 
analysis and elicitation of business needs and characteristics, data understanding is the analysis of the data that 
is going to be examined, data preparation represents the pre-processing and organization of the data, model 
building refers to the applying model building approaches from data as in classification and clustering 
algorithmic methods, in the model testing and evaluation indicates the assessment of different models used and 
fitting to the data, finally deployment refers to the finalization and generation of the data mining analysis results 
to the stakeholders [5]. 


Business Understanding Data Understanding Data Preparation 


Data 
Sources 


Deployment Testing and Evaluation Model Building 


Figure 1. Data mining cycle 


2.1. Data gathering and processing 

Following the literature review of the study, a research model composed of variables age, gender, 
corona virus deaths, state have been used for the analysis. For data set, public records of United States (US) 
Department of Health and Human Services-Centers for Disease Control and Prevention has been used 
composed of 1378 instances and four variables as shown in Table 1 [6]-[9]. Age group refers to the age range, 
sex is the physiological and sexual characteristic of an individual, covid-19 deaths indicate the number of 
deaths and finally state represents the 50 federal territories which adhere and relate to the central United States 
government [6]-[9]. 


Table 1. List of attributes 


Age Group Nominal 
Sex Nominal 
COVID-19 Deaths Numeric 
State Nominal 


3. FINDINGS 

In this research comparison of performances of several data mining algorithms have been made and 
rules discovered by data mining algorithms have been shared [10], [11]. In this study, comparison of the 
algorithms of J48, JRip, Part, OneR Method, Multilayer Perceptron, Bayesian networks have been made. In 
testing the research model with each of the data mining approaches 66 percent of the data has been used for 
the training whereas remaining part of the data set has been used for the testing of the model. Among different 
data mining approaches J48 had the values (RMSE=0.35; precision=0,53; correct classification rate=48.94%; 
incorrect classification rate=51.05). JRip had the values (RMSE=0.34; precision=0,491; correct classification 
rate=53.29%; incorrect classification rate=46.70). Part had the values (RMSE=0.35; precision=0.49; correct 
classification rate=48.94%; incorrect classification rate=51.05). OneR had the values (RMSE=0.54; 
precision=N/A; correct classification rate=41.16%; incorrect classification rate=58.83). OneR had the values 
(RMSE=0.54; precision=N/A; correct classification rate=41.16%; incorrect classification rate=58.83). 
Multilayer Perceptron had the values (RMSE=0.50; precision=0.42; correct classification rate=35.75%; 
incorrect classification rate=64.24). Bayesian networks had the values (RMSE=0.37; precision=N/A; correct 
classification rate=42.61%; incorrect classification rate=57.38) [12]-[15]. 
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Among all the algorithms, JRip had the most correct classification rate with 53.29% and a precision 
0.491. It also had the lowest RMSE with a value of 0.34 [16]-[18]. Comparison of data mining methods used 
can be seen in Table 2. Some of the rules discovered by applied algorithms are as follow and can be found in 
Figure 2 in detail. If number of deaths are small then the state is Puerto Rico, West Virginia, Delaware, 
Tennessee, Alabama, Arizona, Texas if it is significantly higher it is either New Jersey, New York City or 
California respectively. Males are in risk group compared to their women counter partners. For the same 
categories women has a less coronavirus death rate. Under 14 years of age there is not a high coronavirus 
caused death rate and deaths in this category are mainly male dominated. For over 85 years of age coronavirus 
caused deaths are mainly male dominated. For all age groups and sexes coronavirus caused deaths are possible. 
For the age group above 65 years of age coronavirus caused deaths for all sexes. The higher the age the risk 
gets higher. 


Table 2. Comparison of the data mining methods 


Method RMSE Precision Correctly classified % Incorrectly classified % 
J48 0.35 0.535 48.94 51.05 
JRip 0.34 0.491 53.29 46.70 
Part 0.35 0.49 48.94 51.05 
Oner Method 0.54 N/A 41.16 58.83 
Multilayer percept. 0.50 0.42 35.75 64.24 
Bayesian networks 0.37 N/A 42.61 57.38 


If number of deaths are small then the state is Puerto Rico, West Virginia, Delaware, 
Tennessee, Alabama, Arizona, Texas if it is significantly higher it is either New Jersey, New 
York City or California respectively 

Males are in risk group compared to their women counter partners. For the same categories 
women has a less coronavirus death rate 

Under 14 years of age there is not a high coronavirus caused death rate and deaths in this 
category are mainly male dominated 

For over 85 years of age coronavirus caused deaths are mainly male dominated 

For all age groups and sexes coronavirus caused deaths are possible 

For the age group above 65 years of age coronavirus caused deaths for all sexes. The higher 
the age the risk gets higher. 


Figure 2. Rules discovered by data mining algorithms 


4. DISCUSSION 

Epidemic diseases can be extremely dangerous with its hazarding short term and long-term effects. 
So, understanding factors associated with it and applying precautions just in time would have positive influence 
on prevention of the spread of disease, can save many lives and eliminate negative consequences. In 2019, 
outbreak of a new coronavirus known as COVID-19 had been identified in Wuhan, China. Virus later had seen 
in different countries and regions as well. In the research process underlying reasons of covid-19 caused deaths 
have been examined using some of the data mining approaches following an intensive literature review. This 
is later followed with the model formation and applying the data mining techniques as suggested in literature. 
In the analysis part, relationship between different constructs have been examined. The model has been trained 
using 66 percent of the data whereas remaining part of the data has been used for testing of the model for each 
analysis approach [10], [19]. 

Data mining can be defined as the process of gaining insights and extracting knowledge from data. 
Knowledge discovery, prediction or forecasting can be in the focus of data mining activities. Jrip, part, OneR 
method, multilayer perceptron (neural networks) and Bayesian networks have been chosen as the data mining 
techniques applied [20]-[22]. Among them JRip is a rule learner alike in principle to the rule learner Ripper 
[23]. The part algorithm combines two common data mining strategies; the divide and conquer strategy for 
decision tree learning with the separate and conquer strategy for rule learning. OneR generates a one level 
decision tree that is expressed in the form of a set of rules that all test one particular attribute. Multilayer 
Perceptron is a version of the original perceptron model proposed by Rosenblatt in the 1950s and considered 
as a type of neural networks [23]—[28]. A perceptron (artificial neuron) is a function of several input perceptrons 
which is formed as a combination of input weights to the hidden layer perceptrons which lead them to the 
output layer. Finally graphical models such as Bayesian networks supply a general framework for dealing with 
uncertainly in a probabilistic setting and thus are well suited to tackle the problem of prediction [23]—[31]. 
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5. CONCLUSION 

In this study some of the factors that are related with corona virus caused deaths are analyzed using 
data mining techniques composed of supervised and unsupervised machine learning approaches. According to 
the analysis results, JRip had the most correct classification rate with 53.29% and a precision 0.491. It also had 
the lowest RMSE with a value of 0.34. Based on the classification rate and RMSE measure, JRip can be 
considered as an effective method in understanding factors that are related with corona virus caused deaths. 

Some of the highlights discovered with classification and clustering data mining algorithms are as 
follow. If number of deaths are small then the state is Puerto Rico Puerto Rico, West Virginia, Delaware, 
Tennessee, Alabama, Arizona, Texas if it is significantly higher it is either New Jersey, New York City or 
California respectively. Males are in risk group compared to their women counter partners. For the same 
categories women has a less coronavirus death rate. Under 14 years of age there is not a high coronavirus 
caused death rate and deaths in this category are mainly male dominated. For over 85 years of age coronavirus 
caused deaths are mainly male dominated. For all age groups and sexes coronavirus caused deaths are possible. 
For the age group above 65 years of age coronavirus caused deaths for all sexes. The higher the age the risk 
gets higher. Of all the algorithms applied, Jrip had the most correct classification rate with 53.29%, a precision 
of 0.49 and lowest RMSE with a value of 0.34. 
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